DSAN5600 Lab4: ARIMA and SARIMA Models

Author

Dr. Purna Gamage

ARIMA Model

Class Example

Step 1: Plotting the data

As with any data analysis, the first step is to create a time plot of the data to visually inspect for any irregularities or patterns.

We will use Job Postings on Indeed in the United States as our example dataset.

###### Load Indeed Data and Rename Columns ######
indeed_data <- read.csv("indeed.csv")
colnames(indeed_data) <- c("Date", "job_postings")

###### Plot Original Data using Plotly ######
plot_ly(indeed_data, x = ~as.Date(Date)) %>%
  add_lines(y = ~job_postings, name = 'Original Job Postings', line = list(color = 'blue')) %>%
  layout(title = "Original Job Postings",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Job Postings"))

The plot of job postings on Indeed in the United States from 2020 to 2025 reflects significant economic shifts over this period. A sharp decline in early 2020 corresponds to the onset of the COVID-19 pandemic, followed by a rapid recovery in 2021 as businesses adapted and reopened. Job postings peaked in early 2022, likely driven by post-pandemic economic optimism and increased hiring demand. However, from mid-2022 onward, there was a steady decline, suggesting a cooling labor market potentially influenced by inflation, rising interest rates, and broader economic tightening. By 2024, the decline stabilized, indicating that the job market may have reached a new equilibrium with more consistent, albeit cautious, hiring patterns.

Converting Stock to a Time Series Object

To proceed with modeling, it’s essential to convert the stock prices into a time series object. This format allows us to apply time series analysis techniques effectively, ensuring that the temporal structure of the data is preserved.

Note: We are currently working with a univariate time series, focusing solely on number of jobs posted in indeed.com.

###### Convert to Time Series Object ######
ts_indeed <- ts(indeed_data$job_postings, 
                start = decimal_date(as.Date(min(indeed_data$Date))), 
                frequency = 365.25)  # Daily data

Checking for Stationarity and Serial Correlation

Before fitting an ARIMA model, it’s crucial to determine if the series is stationary—meaning its statistical properties, like mean and variance, remain consistent over time.

Step 2: Make the time series stationary

Let’s go back to our example.

# Plot the Autocorrelation Function (ACF) for Original Data
ggAcf(ts_indeed, 50) +
  ggtitle("Autocorrelation Function of Original Job Postings") +
  theme_minimal()

To assess serial correlation, we use the Autocorrelation Function (ACF) plot. The ACF plot shows clear signs of high serial correlation, indicating that past values significantly influence future values. This indicates differencing is needed to achieve stationarity.

# Lag plot to visualize patterns in Original Data
gglagplot(ts_indeed, do.lines = FALSE) +
  ggtitle("Lag Plot of Original Job Postings") +
  theme_minimal()

Lag plots indicate the same high correlation that is visible in the ACF plot.

Augmented Dickey Fuller (ADF) Test

The null hypothesis of the ADF test is that the time series is non-stationary.

So, if the p-value of the test is less than the significance level (0.05) then you reject the null hypothesis and infer that the time series is indeed stationary at 5% significance level.

So, in our case, if P Value > 0.05 indicate non-stationarity. Therefore, we go ahead with differencing the data to make the process stationary.

tseries::adf.test(ts_indeed)

    Augmented Dickey-Fuller Test

data:  ts_indeed
Dickey-Fuller = -1.1083, Lag order = 12, p-value = 0.9213
alternative hypothesis: stationary

P-value is greater than the significance level of 0.05 indicating that the series is not stationary.

Differencing

The purpose of differencing is to transform the time series into a stationary one, where statistical properties remain constant over time.

###### Differencing the Data ######
diff_indeed <- diff(ts_indeed)

# Plot Differenced Original Data
p1 <- ggplot() +
  geom_line(aes(x = time(diff_indeed), y = diff_indeed), color = "blue") +
  labs(title = "Differenced Original Job Postings",
       x = "Time",
       y = "Differenced Job Postings") +
  theme_minimal()

# Display the Differenced Plot
p1

Order of Differencing

library(patchwork)

###### First and Second Order Differencing ######
# First-order differencing
diff_1 <- diff(ts_indeed)

# Second-order differencing
diff_2 <- diff(ts_indeed, differences = 2)

###### Plot ACF for First and Second Order Differenced Series ######
# ACF plot for first-order differenced series
acf_plot_1 <- ggAcf(diff_1,50) +
  ggtitle("ACF of First-Order Differenced Series") +
  theme_minimal()

# ACF plot for second-order differenced series
acf_plot_2 <- ggAcf(diff_2,50) +
  ggtitle("ACF of Second-Order Differenced Series") +
  theme_minimal()

# Display both ACF plots side by side
acf_plot_1 / acf_plot_2

The ACF plots provide valuable insights into the differencing required to achieve stationarity:

  1. First-Order Differencing:

The ACF of the first-order differenced series shows significant positive autocorrelations that persist across many lags. This indicates that first-order differencing is not sufficient to make the series stationary, as the correlations do not decay quickly to zero.

  1. Second-Order Differencing:

The ACF of the second-order differenced series shows a much better result. The autocorrelations drop close to zero rapidly, with only a few spikes within the confidence intervals. This suggests that the second-order differencing has effectively removed the trend, making the series closer to stationary.

While the second-order differencing (d = 2) seems to provide a more stationary series, we should be cautious not to over-difference. To comply with the principle of parsimony—which favors simpler models with fewer parameters—we can explore ARIMA models using both d = 1 and d = 2. Comparing model performance across different parameter combinations will help determine whether the additional differencing improves model accuracy or unnecessarily complicates the model.

Step 3: Order of the AR term (p)?

The next step is to determine whether the model requires any AutoRegressive (AR) terms. This can be done by analyzing the Partial Autocorrelation Function (PACF) plot, which helps identify the appropriate number of AR terms to include in the model.

# Plot the Partial Autocorrelation Function (PACF) for Differenced Data
p1<-ggPacf(diff_1,50) +
  ggtitle("PACF of First Differenced Job Postings") +
  theme_minimal()

p2<-ggPacf(diff_2,50) +
  ggtitle("PACF of Second Differenced Job Postings") +
  theme_minimal()

p1/p2

Here you can observe, when d=1, there are high correlation at lags; p = 0,1,2,3 (rest of the lags are too high). When d=2, there are high correlations at; p =0,1,2. 0 is a choice here since the correlation is not dying down.

Step 4: Order of the MA term (q)

To determine the appropriate number of Moving Average (MA) terms for the model, we analyze the Autocorrelation Function (ACF) plot.

# Plot the Partial Autocorrelation Function (ACF) for Differenced Data
p1<-ggAcf(diff_1,50) +
  ggtitle("ACF of First Differenced Job Postings") +
  theme_minimal()

p2<-ggAcf(diff_2,50) +
  ggtitle("ACF of Second Differenced Job Postings") +
  theme_minimal()

p1/p2

Here, it is clear that d=2 provides better stationarity. Hence, at d=2, we can find correlation at lags; q=1 (rest of the lags are too high for parameters)

Step 5: Parameter Tuning

Let’s go throgh different parameter combinations to find the best model.

# Load necessary libraries
library(knitr)
library(kableExtra)


# Define parameter ranges
p_range <- 0:4
d_range <- 1:2
q_range <- 1  # Single value for q

# Calculate total combinations
n_combinations <- length(p_range) * length(d_range) * length(q_range)

# Create an empty matrix to store results
results_matrix <- matrix(NA, nrow = n_combinations, ncol = 6)

# Initialize index for matrix row
i <- 1

# Loop through combinations of ARIMA model parameters
for (d in d_range) {
  for (p in p_range) {
    q <- q_range

    # Fit ARIMA model with specified (p,d,q)
    model <- Arima(ts_indeed, order = c(p, d, q), include.drift = TRUE)

    # Store model parameters and AIC/BIC/AICc values in matrix
    results_matrix[i, ] <- c(p, d, q, model$aic, model$bic, model$aicc)

    # Increment row index
    i <- i + 1
  }
}

# Convert matrix to data frame
results_df <- as.data.frame(results_matrix)
colnames(results_df) <- c("p", "d", "q", "AIC", "BIC", "AICc")

# Find the row with the lowest AIC
highlight_row <- which.min(results_df$AIC)

# Generate kable table with highlighting for the row with the lowest AIC
knitr::kable(results_df, align = 'c', caption = "Comparison of ARIMA Models") %>%
  kable_styling(full_width = FALSE, position = "center") %>%
  row_spec(highlight_row, bold = TRUE, background = "#FFFF99")  # Highlight row in yellow
Comparison of ARIMA Models
p d q AIC BIC AICc
0 1 1 -1444.427 -1427.897 -1444.414
1 1 1 -2889.317 -2867.278 -2889.295
2 1 1 -2888.179 -2860.630 -2888.146
3 1 1 -2888.354 -2855.294 -2888.308
4 1 1 -2886.284 -2847.715 -2886.222
0 2 1 -2837.687 -2826.668 -2837.680
1 2 1 -2835.773 -2819.245 -2835.759
2 2 1 -2879.716 -2857.679 -2879.694
3 2 1 -2888.634 -2861.088 -2888.601
4 2 1 -2832.020 -2798.964 -2831.973

Here, it indicates that ARIMA(1,1,1) model has the lowest AIC and BIC values.

However, Let’s check with auto.arima():

auto.arima(ts_indeed)
Series: ts_indeed 
ARIMA(4,2,2) 

Coefficients:
          ar1      ar2      ar3      ar4     ma1     ma2
      -0.7639  -1.1123  -0.4509  -0.2923  0.4395  0.9794
s.e.   0.0227   0.0270   0.0268   0.0227  0.0047  0.0061

sigma^2 = 0.01021:  log likelihood = 1595.9
AIC=-3177.8   AICc=-3177.74   BIC=-3139.23

This indicates ARIMA(4,2,2) model. Therefore, I will run the parameter combination again with adding p=4 and q=2.

# Load necessary libraries
library(knitr)
library(kableExtra)
library(forecast)

# Create an empty matrix to store results
results_matrix <- matrix(rep(NA, 6 * 20), nrow = 20)

# Initialize index for matrix row
i <- 1

# Loop through ARIMA model parameters: d = 1,2; p = 0:4; q = 1,2
for (d in 1:2) {
  for (p in 0:4) {
    for (q in 1:2) {
      
      # Ensure 'ts_indeed' is a univariate time series by selecting the correct column
      model <- Arima(ts_indeed, order = c(p, d, q), include.drift = TRUE)
      
      # Store model parameters and AIC/BIC/AICc values in matrix
      results_matrix[i, ] <- c(p, d, q, model$aic, model$bic, model$aicc)
      
      # Increment row index
      i <- i + 1
    }
  }
}

# Convert matrix to data frame
results_df <- as.data.frame(results_matrix)
colnames(results_df) <- c("p", "d", "q", "AIC", "BIC", "AICc")

# Find the row with the minimum AIC
highlight_row <- which.min(results_df$AIC)

# Generate kable table with highlighting for the row with the minimum AIC
knitr::kable(results_df, align = 'c', caption = "Comparison of ARIMA Models") %>%
  kable_styling(full_width = FALSE, position = "center") %>%
  row_spec(highlight_row, bold = TRUE, background = "#FFFF99")  # Highlight row in yellow
Comparison of ARIMA Models
p d q AIC BIC AICc
0 1 1 -1444.427 -1427.897 -1444.414
0 1 2 -2072.407 -2050.367 -2072.385
1 1 1 -2889.317 -2867.278 -2889.295
1 1 2 -2888.391 -2860.842 -2888.358
2 1 1 -2888.179 -2860.630 -2888.146
2 1 2 -2899.974 -2866.915 -2899.928
3 1 1 -2888.354 -2855.294 -2888.308
3 1 2 -2886.356 -2847.787 -2886.294
4 1 1 -2886.284 -2847.715 -2886.222
4 1 2 -3102.517 -3058.438 -3102.438
0 2 1 -2837.687 -2826.668 -2837.680
0 2 2 -2835.771 -2819.243 -2835.758
1 2 1 -2835.773 -2819.245 -2835.759
1 2 2 -2833.791 -2811.753 -2833.769
2 2 1 -2879.716 -2857.679 -2879.694
2 2 2 -2831.824 -2804.277 -2831.791
3 2 1 -2888.634 -2861.088 -2888.601
3 2 2 -2884.149 -2851.093 -2884.103
4 2 1 -2832.020 -2798.964 -2831.973
4 2 2 -3177.797 -3139.231 -3177.735

It is indeed that the model ARIMA(4,2,2) has the lowest AIC value. However, ARIMA(4,1,2) has the lowest BIC value.

Step 6: Model Selection - Model Diagnostics

We have several choices for our models. Let’s check the model diagnostics to find the best model.

The Models chosen by the above analysis:

  • ARIMA(1,1,1)
  • ARIMA(4,1,2)
  • ARIMA(4,2,2)

ARIMA(1,1,1)

# Set seed for reproducibility
set.seed(345)

##### Capture ARIMA model output for diagnostics #####

# Example model diagnostics for ARIMA(1,1,1)
model_output <- capture.output(sarima(ts_indeed, 1, 1, 1))

# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output)  # Locate where coefficient details start
end_line <- length(model_output)  # Last line of output

# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")
Coefficients: 
         Estimate     SE  t.value p.value
ar1        0.9371 0.0091 102.7188  0.0000
ma1       -0.3222 0.0242 -13.3017  0.0000
constant   0.0057 0.0274   0.2095  0.8341

sigma^2 estimated as 0.01196945 on 1823 degrees of freedom 
 
AIC = -1.58232  AICc = -1.582313  BIC = -1.570251 
 

The Residual Plot shows nearly consistent fluctuation around zero, suggesting that the residuals are nearly stationary with a constant mean and finite variance over time.

The Autocorrelation Function (ACF) of the residuals reveals no significant autocorrelations except around lag 10, which is too high to include in the model.

The Q-Q Plot indicates that the residuals follow a near-normal distribution, with minor deviations at the tails, which is typical in time series data.

The Ljung-Box Test p-values remain mostly below the 0.05 significance level, implying some significant autocorrelations left in the residuals and concluding that model improvements may necessary.

Coefficient Significance: All model coefficients are significant for the ARIMA(1,1,1) model.

ARIMA(4,1,2)

# Set seed for reproducibility
set.seed(345)

##### Capture ARIMA(4,1,2) model output for diagnostics #####

# Fit ARIMA(4,1,2) model
model_output <- capture.output(sarima(ts_indeed, 4, 1, 2))

# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output)  # Locate where coefficient details start
end_line <- length(model_output)  # Last line of output

# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")
Coefficients: 
         Estimate     SE  t.value p.value
ar1        0.2480 0.0235  10.5738  0.0000
ar2       -0.2489 0.0205 -12.1517  0.0000
ar3        0.5513 0.0211  26.1541  0.0000
ar4        0.1946 0.0232   8.3889  0.0000
ma1        0.4427 0.0045  98.1955  0.0000
ma2        0.9893 0.0042 236.8259  0.0000
constant   0.0056 0.0228   0.2432  0.8079

sigma^2 estimated as 0.01057986 on 1819 degrees of freedom 
 
AIC = -1.699078  AICc = -1.699045  BIC = -1.674939 
 

The Residual Plot shows nearly consistent fluctuation around zero, suggesting that the residuals are nearly stationary with a constant mean and finite variance over time.

The Autocorrelation Function (ACF) of the residuals reveals significant autocorrelations around lag 2 and high order lags, which are too high to include in the model.

The Q-Q Plot indicates that the residuals follow a near-normal distribution, with minor deviations at the tails, which is typical in time series data.

The Ljung-Box Test p-values remain below the 0.05 significance level, implying some significant autocorrelations left in the residuals and concluding that model improvements may necessary.

Coefficient Significance: All model coefficients are significant for the ARIMA(4,1,2) model.

ARIMA(4,2,2)

# Set seed for reproducibility
set.seed(345)

##### Capture ARIMA(4,2,2) model output for diagnostics #####

# Fit ARIMA(4,2,2) model
model_output <- capture.output(sarima(ts_indeed, 4, 2, 2))

# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output)  # Locate where coefficient details start
end_line <- length(model_output)  # Last line of output

# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")
Coefficients: 
    Estimate     SE  t.value p.value
ar1  -0.7639 0.0227 -33.7260       0
ar2  -1.1123 0.0270 -41.2599       0
ar3  -0.4509 0.0268 -16.8315       0
ar4  -0.2923 0.0227 -12.8834       0
ma1   0.4395 0.0047  92.8622       0
ma2   0.9794 0.0061 161.3321       0

sigma^2 estimated as 0.01016194 on 1819 degrees of freedom 
 
AIC = -1.741258  AICc = -1.741233  BIC = -1.720127 
 

The Residual Plot shows nearly consistent fluctuation around zero(better than the two plots above), suggesting that the residuals are nearly stationary with a constant mean and finite variance over time.

The Autocorrelation Function (ACF) of the residuals still shows some significant autocorrelations at high lags, which is too high to include in the model.

The Q-Q Plot indicates that the residuals follow a near-normal distribution, with minor deviations at the tails, which is typical in time series data.

The Ljung-Box Test p-values remain below the 0.05 significance level, implying some significant autocorrelations left in the residuals and concluding that model improvements may necessary.

Coefficient Significance: All model coefficients are significant.

Conclusion:

Model diagnostics were similar with all the above models. Therefore, we will choose the model with lowest AIC which is ARIMA(4,2,2).

Model fitting

# Fit ARIMA(4,2,2) model
fit <- Arima(ts_indeed, order = c(4, 2, 2), include.drift = FALSE)

# Display model summary
summary(fit)
Series: ts_indeed 
ARIMA(4,2,2) 

Coefficients:
          ar1      ar2      ar3      ar4     ma1     ma2
      -0.7639  -1.1123  -0.4509  -0.2923  0.4395  0.9794
s.e.   0.0227   0.0270   0.0268   0.0227  0.0047  0.0061

sigma^2 = 0.01021:  log likelihood = 1595.9
AIC=-3177.8   AICc=-3177.74   BIC=-3139.23

Training set error measures:
                        ME      RMSE        MAE         MPE       MAPE
Training set -9.759566e-05 0.1008055 0.07215598 0.000640841 0.06057821
                    MASE        ACF1
Training set 0.002546013 0.007569145

Model Equation

A process \(x_t\) is said to be \(\operatorname{ARIMA}(p, d, q)\) if

\[ \nabla^d x_t=(1-B)^d x_t \]

Where;

\[ \phi(B)=1+0.76 B+1.11 B^2+ 0.45 B^3 + 0.29 B^4 \]

\[ \theta(B)=1+ 0.44 B+ 0.98 B^2 \]

Step 7: Model Forecasting

Forecasting is the final and most crucial step in time series modeling, as the primary goal of building these models is to generate accurate predictions of future values.

# Forecast the next 365 periods
forecast_result <- forecast(fit, h = 365)

# Display forecast accuracy
accuracy(forecast_result)
                        ME      RMSE        MAE         MPE       MAPE
Training set -9.759566e-05 0.1008055 0.07215598 0.000640841 0.06057821
                    MASE        ACF1
Training set 0.002546013 0.007569145
# Plot the forecast
autoplot(forecast_result) +
  labs(title = "ARIMA(4,2,2) Forecast",
       x = "Time",
       y = "Predicted Values") +
  theme_minimal()

The forecast for job postings over the next year (approximately 365 days) suggests a potential decline in the number of new postings. This downward trend could be attributed to seasonal hiring patterns, where many summer and fall positions are typically posted earlier in the year, particularly in January. Another possible explanation could be a broader market trend indicating a slowdown in job availability.

The confidence intervals shown in the shaded blue area represent the range within which the actual job postings are expected to fall. The widening of these intervals as we move further into the forecast period reflects increasing uncertainty. While the central forecast suggests a decline, the broader confidence range indicates that there is still some possibility for fluctuations—either a steeper decline or a stabilization in job postings. This uncertainty highlights the need to monitor the job market closely over the coming months to determine whether this trend is seasonal or indicative of a larger economic shift.

Benchmark Methods

For univariate time series forecasting, establishing benchmarks is straightforward. At a minimum, forecasts should be compared against a naive method and a standard approach like an ARIMA model to evaluate the effectiveness of new techniques.

Source

By comparing new forecasting models against these baseline methods, we can better understand whether the proposed techniques provide meaningful improvements.

Let’s go back to our example on the job postings at indeed.com.

# Fit models individually and check residuals
f_mean <- meanf(ts_indeed, h = 200)
head(f_mean$mean) #gives the forecasted values
Time Series:
Start = 2025.08675284164 
End = 2025.10044209558 
Frequency = 365.25 
[1] 123.8298 123.8298 123.8298 123.8298 123.8298 123.8298
mean(ts_indeed)
[1] 123.8298
f_naive <- naive(ts_indeed, h = 200)
#checkresiduals(f_naive)

f_drift <- rwf(ts_indeed, drift = TRUE, h = 200)
checkresiduals(f_drift)


    Ljung-Box test

data:  Residuals from Random walk with drift
Q* = 19674, df = 365, p-value < 2.2e-16

Model df: 0.   Total lags used: 365
# Plot forecasts using Mean, Naïve, Drift methods, and ARIMA fit
autoplot(ts_indeed) +
  autolayer(meanf(ts_indeed, h = 365), series = "Mean", PI = FALSE) +
  autolayer(naive(ts_indeed, h = 365), series = "Naïve", PI = FALSE) +
  autolayer(rwf(ts_indeed, drift = TRUE, h = 365), series = "Drift", PI = FALSE) +
  autolayer(forecast(fit, h = 365), series = "ARIMA Fit", PI = FALSE) +
  ggtitle("Job Postings Forecast") +
  xlab("Date") + ylab("Number of Job Postings") +
  guides(colour = guide_legend(title = "Forecast Methods")) +
  theme_minimal()

In plotly so we can zoom the forecasts:

# Generate forecasts
mean_forecast <- meanf(ts_indeed, h = 365)
naive_forecast <- naive(ts_indeed, h = 365)
drift_forecast <- rwf(ts_indeed, drift = TRUE, h = 365)
arima_forecast <- forecast(fit, h = 365)

# Convert forecasts to data frames
mean_df <- data.frame(Date = time(mean_forecast$mean), Mean = as.numeric(mean_forecast$mean))
naive_df <- data.frame(Date = time(naive_forecast$mean), Naive = as.numeric(naive_forecast$mean))
drift_df <- data.frame(Date = time(drift_forecast$mean), Drift = as.numeric(drift_forecast$mean))
arima_df <- data.frame(Date = time(arima_forecast$mean), ARIMA_Fit = as.numeric(arima_forecast$mean))

# Original data
ts_indeed_df <- data.frame(Date = time(ts_indeed), Job_Postings = as.numeric(ts_indeed))

# Create Plotly plot
plot_ly() %>%
  add_lines(data = ts_indeed_df, x = ~Date, y = ~Job_Postings, name = 'Original Data', line = list(color = 'black')) %>%
  add_lines(data = mean_df, x = ~Date, y = ~Mean, name = 'Mean Forecast', line = list(color = 'blue')) %>%
  add_lines(data = naive_df, x = ~Date, y = ~Naive, name = 'Naïve Forecast', line = list(color = 'red')) %>%
  add_lines(data = drift_df, x = ~Date, y = ~Drift, name = 'Drift Forecast', line = list(color = 'green')) %>%
  add_lines(data = arima_df, x = ~Date, y = ~ARIMA_Fit, name = 'ARIMA Fit', line = list(color = 'purple')) %>%
  layout(title = 'Job Postings Forecast',
         xaxis = list(title = 'Date'),
         yaxis = list(title = 'Number of Job Postings'),
         legend = list(title = list(text = 'Forecast Methods')))

By zooming n, it is very clear that the ARIMA model does a better method than the basic bench mark methods.

fpp3 package

fpp3 package has a new tool for fitting models automatically similar to auto.arima() but the model procedure is different in the fpp3 package.

# Load necessary libraries
library(tsibble)
library(fable)
library(feasts)
library(dplyr)

# Ensure 'Date' is in date format
indeed_data$Date <- as.Date(indeed_data$Date)

# Convert to tsibble
ts_indeed2 <- indeed_data %>%
  as_tsibble(index = Date)

# Fit ARIMA model
fit <- ts_indeed2 %>%
  model(ARIMA(job_postings))

# Display model summary
report(fit)
Series: job_postings 
Model: ARIMA(2,2,2)(1,0,1)[7] 

Coefficients:
         ar1      ar2      ma1     ma2     sar1     sma1
      1.2927  -0.3257  -1.5437  0.6053  -0.0245  -0.9126
s.e.  0.0792   0.0781   0.0679  0.0660   0.0291   0.0116

sigma^2 estimated as 0.007068:  log likelihood=1927.94
AIC=-3841.87   AICc=-3841.81   BIC=-3803.31

This is a seasonal model. This could be better considering the data was daily and it was indicated that the Frequency was Daily,7-Day source. We can revisit in the next chapter.

Check the model diagnostics:

# Set seed for reproducibility
set.seed(345)

##### Capture ARIMA(2,2,2)(1,0,1)[7] model output for diagnostics #####

# Fit seasonal ARIMA model
model_output <- capture.output(sarima(ts_indeed, 2, 2, 2, 1, 0, 1, 7))

# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output)  # Locate where coefficient details start
end_line <- length(model_output)  # Last line of output

# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")
Coefficients: 
     Estimate     SE  t.value p.value
ar1    1.2927 0.0792  16.3193  0.0000
ar2   -0.3257 0.0781  -4.1686  0.0000
ma1   -1.5437 0.0679 -22.7379  0.0000
ma2    0.6053 0.0660   9.1692  0.0000
sar1  -0.0245 0.0291  -0.8422  0.3998
sma1  -0.9126 0.0116 -78.4846  0.0000

sigma^2 estimated as 0.007033595 on 1819 degrees of freedom 
 
AIC = -2.105135  AICc = -2.105109  BIC = -2.084003 
 

Still it dosen’t make a big difference but seasonal ar1 coefficient is not significant here. However, this model has a lower AIC value than the ARIMA(4,2,2) model.

# Load necessary library
library(forecast)

# Fit ARIMA(2,2,2)(1,0,1)[7] model
fit <- Arima(ts_indeed, order = c(2, 2, 2), seasonal = list(order = c(1, 0, 1), period = 7), include.drift = FALSE)

# Display model summary
summary(fit)
Series: ts_indeed 
ARIMA(2,2,2)(1,0,1)[7] 

Coefficients:
         ar1      ar2      ma1     ma2     sar1     sma1
      1.2927  -0.3257  -1.5437  0.6053  -0.0245  -0.9126
s.e.  0.0792   0.0781   0.0679  0.0660   0.0291   0.0116

sigma^2 = 0.007068:  log likelihood = 1927.94
AIC=-3841.87   AICc=-3841.81   BIC=-3803.31

Training set error measures:
                       ME       RMSE        MAE         MPE       MAPE
Training set 0.0001746875 0.08388585 0.05719408 0.001490598 0.04855488
                    MASE        ACF1
Training set 0.002018085 -0.01442729
# Generate the forecast
forecast_result <- forecast(fit, h = 365)

# Plot the forecast
autoplot(forecast_result) +
  labs(title = "ARIMA(2,2,2)(1,0,1)[7] Forecast",
       x = "Time",
       y = "Predicted Job Postings") +
  theme_minimal()

Prophet Model

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

Prophet is open source software released by Facebook’s Core Data Science team. It is available for download on CRAN and PyPI.

source

In R

Read More 1

Read More 2

# Load required libraries
library(prophet)


# Prepare the dataset for Prophet
job_postings_data <- indeed_data %>% 
  select(Date, job_postings) %>% 
  rename(ds = Date, y = job_postings)

# Initialize and fit the Prophet model
prophet_model <- prophet(job_postings_data)

# Extend the forecast horizon by 365 days
future_dates <- make_future_dataframe(prophet_model, periods = 365)

# Generate future predictions
predictions <- predict(prophet_model, future_dates)

# Visualize the forecast
plot(prophet_model, predictions)

This forecasting looks better than the simple ARIMA or the SARIMA model. We will discuss more examples later.

Financial Time Series: Apple Stock prices

We will use Apple stock prices as our example dataset.

###### Load Apple Stock Data from Yahoo Finance ######
AAPL <- getSymbols("AAPL", auto.assign = FALSE, from = "1980-01-01", src = "yahoo")

###### Convert to Data Frame and Add Date Column ######
AAPL <- data.frame(AAPL)
AAPL <- data.frame(AAPL, rownames(AAPL))

###### Rename the Date Column ######
colnames(AAPL)[7] <- "date"

###### Convert Date Column to Date Format ######
AAPL$date <- as.Date(AAPL$date, "%Y-%m-%d")

###### Create Data Frame with Original and Log-Transformed Adjusted Close Prices ######
data <- data.frame(AAPL$date, AAPL$AAPL.Adjusted, log(AAPL$AAPL.Adjusted))
colnames(data) <- c("date", "AAPL", "log.AAPL")

###### Plot Original and Log-Transformed Adjusted Close Prices ######
ggplot(data) +
  geom_line(aes(x = date, y = AAPL, color = "Adjusted Close")) +
  geom_line(aes(x = date, y = log.AAPL, color = "Log Adjusted Close")) +
  labs(title = "Apple Stock Prices (Adjusted & Log-Transformed)",
       x = "Date",
       y = "Price",
       color = "Legend") +
  theme_minimal()

AAPL is the ticker symbol for Apple Inc.’s stock, and we have retrieved its historical prices from Yahoo Finance.

Upon plotting the data, we observe heteroscedasticity, where the variability increases over time. This non-constant variance can negatively impact model accuracy. To mitigate this, we apply a log transformation, which helps stabilize the variance.

The log-transformed data reduces heteroscedasticity, providing a more consistent trend and making it more suitable for reliable model fitting.

###### Convert to Time Series Object ######
ts_AAPL <- ts(data$AAPL, 
              start = decimal_date(as.Date("1980-12-12")), 
              frequency = 252)  # Approx. number of trading days in a year

# Plot the Autocorrelation Function (ACF)
ggAcf(ts_AAPL,50) +
  ggtitle("Autocorrelation Function of Apple Stock Prices") +
  theme_minimal()

To assess serial correlation, we use the Autocorrelation Function (ACF) plot. The ACF plot shows clear signs of high serial correlation, indicating that past values significantly influence future values. This indicates differencing is needed to achieve stationarity.

# Lag plot to visualize patterns at different lags
gglagplot(ts_AAPL, do.lines = FALSE) +
  ggtitle("Lag Plot of Apple Stock Prices") +
  theme_minimal()

# Plot the Time Series with Moving Averages
autoplot(ts_AAPL, series = "Apple Adjusted Close") +
  autolayer(ma(ts_AAPL, 200), series = "200-Day Moving Average") +
  autolayer(ma(ts_AAPL, 500), series = "500-Day Moving Average") +
  labs(title = "Apple Stock Prices with Moving Averages",
       x = "Year",
       y = "Adjusted Close Price",
       color = "Legend") +
  theme_minimal()

Applying a high moving average (MA) helps smooth out short-term fluctuations, making long-term trends and patterns more visible. In this case, the moving average highlights a substantial increase in Apple’s stock prices, particularly after the year 2000.

Log transformation

###### Load Required Libraries ######

library(patchwork)


###### Create Data Frame with Original and Log-Transformed Adjusted Close Prices ######
data <- data.frame(AAPL$date, AAPL$AAPL.Adjusted, log(AAPL$AAPL.Adjusted))
colnames(data) <- c("date", "AAPL", "log.AAPL")

###### Convert to Time Series Object ######
ts_AAPL <- ts(data$AAPL, 
              start = decimal_date(as.Date(min(data$date))), 
              frequency = 12)  # Assuming monthly data

ts_log_AAPL <- ts(data$log.AAPL, 
                  start = decimal_date(as.Date(min(data$date))), 
                  frequency = 12)

###### Differencing the Data ######
diff_AAPL <- diff(ts_AAPL)
diff_log_AAPL <- diff(ts_log_AAPL)

# Create plots for differenced data
p1 <- ggplot() +
  geom_line(aes(x = time(diff_AAPL), y = diff_AAPL), color = "blue") +
  labs(title = "Differenced Apple Stock Prices",
       x = "Time",
       y = "Differenced Prices") +
  theme_minimal()

p2 <- ggplot() +
  geom_line(aes(x = time(diff_log_AAPL), y = diff_log_AAPL), color = "red") +
  labs(title = "Differenced Log-Transformed Apple Stock Prices",
       x = "Time",
       y = "Differenced Log Prices") +
  theme_minimal()

# Combine plots using patchwork
p1 / p2

It is clear that the data needs to be transformed before fitting model.

Differencing - parameter selection

###### Differencing the Log-Transformed Data ######
diff_log_AAPL <- diff(ts_log_AAPL)
diff2_log_AAPL <- diff(ts_log_AAPL, differences = 2)

# ACF Plots for Differenced Log-Transformed Data
acf_plot1 <- ggAcf(diff_log_AAPL) +
  labs(title = "ACF of First Differenced Log-Transformed Apple Stock Prices") +
  theme_minimal()

acf_plot2 <- ggAcf(diff2_log_AAPL) +
  labs(title = "ACF of Second Differenced Log-Transformed Apple Stock Prices") +
  theme_minimal()

# Combine ACF Plots using patchwork
acf_plot1 / acf_plot2

First order differencing is sufficient.

###### ACF and PACF Plots for First Differenced Log-Transformed Data ######
acf_plot <- ggAcf(diff_log_AAPL) +
  labs(title = "ACF of First Differenced Log-Transformed Apple Stock Prices") +
  theme_minimal()

pacf_plot <- ggPacf(diff_log_AAPL) +
  labs(title = "PACF of First Differenced Log-Transformed Apple Stock Prices") +
  theme_minimal()

# Combine ACF and PACF Plots using patchwork
acf_plot / pacf_plot

d=1, p=2:4, q = 2:4

###### Load Necessary Libraries ######
library(knitr)
library(kableExtra)
library(forecast)

###### Create an Empty Matrix to Store Results ######
results_matrix <- matrix(rep(NA, 6 * 9), nrow = 9)

###### Initialize Index for Matrix Row ######
i <- 1

###### Loop through ARIMA Model Parameters: d = 1; p = 2:4; q = 2:4 ######
for (p in 2:4) {
  for (q in 2:4) {
    
    # Fit ARIMA model
    model <- Arima(diff_log_AAPL, order = c(p, 1, q), include.drift = TRUE)
    
    # Store model parameters and AIC/BIC/AICc values in matrix
    results_matrix[i, ] <- c(p, 1, q, model$aic, model$bic, model$aicc)
    
    # Increment row index
    i <- i + 1
  }
}

###### Convert Matrix to Data Frame ######
results_df <- as.data.frame(results_matrix)
colnames(results_df) <- c("p", "d", "q", "AIC", "BIC", "AICc")

###### Find the Row with the Minimum AIC ######
highlight_row <- which.min(results_df$AIC)

###### Generate kable Table with Highlighting for the Row with the Minimum AIC ######
knitr::kable(results_df, align = 'c', caption = "Comparison of ARIMA Models") %>%
  kable_styling(full_width = FALSE, position = "center") %>%
  row_spec(highlight_row, bold = TRUE, background = "#FFFF99")  # Highlight row in yellow
Comparison of ARIMA Models
p d q AIC BIC AICc
2 1 2 -47840.78 -47796.87 -47840.77
2 1 3 -47849.91 -47798.69 -47849.90
2 1 4 -47856.66 -47798.13 -47856.65
3 1 2 -47857.65 -47806.43 -47857.64
3 1 3 -47856.15 -47797.61 -47856.13
3 1 4 -47855.38 -47789.52 -47855.36
4 1 2 -47852.38 -47793.85 -47852.37
4 1 3 -47854.87 -47789.01 -47854.85
4 1 4 -47839.01 -47765.84 -47838.99

ARIMA(3,1,2) is a good model.

Let’s check auto.arima()

auto.arima(ts_log_AAPL)
Series: ts_log_AAPL 
ARIMA(4,1,0)(1,0,0)[12] with drift 

Coefficients:
         ar1      ar2      ar3     ar4    sar1  drift
      0.0159  -0.0192  -0.0303  0.0271  0.0374  7e-04
s.e.  0.0095   0.0095   0.0095  0.0095  0.0095  3e-04

sigma^2 = 0.0007917:  log likelihood = 23951.77
AIC=-47889.54   AICc=-47889.53   BIC=-47838.32

This suggests ARIMA(4,0,0)(1,0,0)[12].

Let’s check fpp3 package:

# Load necessary libraries
library(tsibble)
library(fable)
library(feasts)
library(dplyr)

###### Prepare the Log-Transformed Apple Data ######
# Ensure 'Date' is in date format
AAPL$date <- as.Date(AAPL$date)

# Create a data frame with log-transformed adjusted close prices
apple_data <- AAPL %>% 
  select(date, AAPL.Adjusted) %>% 
  mutate(log_AAPL = log(AAPL.Adjusted))

# Convert to tsibble and fill implicit gaps
ts_apple <- apple_data %>% 
  as_tsibble(index = date) %>%
  fill_gaps()

###### Fit ARIMA Model on Log-Transformed Data ######
fit <- ts_apple %>% 
  model(ARIMA(log_AAPL))

###### Display Model Summary ######
report(fit)
Series: log_AAPL 
Model: ARIMA(0,0,0) w/ mean 

Coefficients:
      constant
        0.5543
s.e.    0.0241

sigma^2 estimated as 4.446:  log likelihood=-26163.24
AIC=52330.47   AICc=52330.47   BIC=52345.85

This gives me a constant model which is not a good model.

Model Comparison

ARIMA(3,1,2)

# Set seed for reproducibility
set.seed(345)

##### Capture ARIMA(3,1,2) model output for diagnostics #####

# Fit ARIMA(4,2,2) model
model_output <- capture.output(sarima(ts_log_AAPL, 3, 1, 2))

# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output)  # Locate where coefficient details start
end_line <- length(model_output)  # Last line of output

# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")
Coefficients: 
         Estimate     SE t.value p.value
ar1       -0.4991 0.2213 -2.2555  0.0241
ar2       -0.1500 0.1616 -0.9283  0.3533
ar3       -0.0408 0.0101 -4.0203  0.0001
ma1        0.5154 0.2214  2.3281  0.0199
ma2        0.1381 0.1622  0.8514  0.3946
constant   0.0007 0.0003  2.6360  0.0084

sigma^2 estimated as 0.0007924691 on 11124 degrees of freedom 
 
AIC = -4.301222  AICc = -4.301221  BIC = -4.29662 
 

The Residual Plot shows nearly consistent fluctuation around zero, but we see some clustering, suggesting that the financial models would be better for this data.

The Autocorrelation Function (ACF) of the residuals shows mostly independence..

The Q-Q Plot indicates that the residuals follow a somewhat-normal distribution, with minor deviations at the tails, which indicate we need model improvements.

The Ljung-Box Test p-values remain mostly below the 0.05 significance level, implying some significant autocorrelations left in the residuals and concluding that model improvements may necessary.

Coefficient Significance: ar2 and ma2 coefficients are not significant.

ARIMA(4,0,0)(1,0,0)[12]

###### Set Seed for Reproducibility ######
set.seed(345)

###### Capture ARIMA Model Output for Diagnostics ######

# Model diagnostics for ARIMA(4,0,0)x(1,0,0)[12]
model_output <- capture.output(sarima(ts_log_AAPL, p = 4, d = 0, q = 0, P = 1, D = 0, Q = 0, S = 12))

# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output)  # Locate where coefficient details start
end_line <- length(model_output)  # Last line of output

# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")

This gives the below error message stating that it has non-stationary AR part which means this is not a good fit for the data:

“Error in arima(xdata, order = c(p, d, q), seasonal = list(order = c(P, : non-stationary AR part from CSS”

Model Forecast

so far ARIMA(3,1,2) is better.

fit <- Arima(ts_log_AAPL, order = c(3, 1, 2), include.drift = FALSE)

# Display model summary
summary(fit)
Series: ts_log_AAPL 
ARIMA(3,1,2) 

Coefficients:
          ar1      ar2      ar3     ma1     ma2
      -0.5074  -0.1566  -0.0401  0.5244  0.1456
s.e.   0.2193   0.1608   0.0102  0.2193  0.1617

sigma^2 = 0.0007933:  log likelihood = 23939.76
AIC=-47867.53   AICc=-47867.52   BIC=-47823.62

Training set error measures:
                       ME       RMSE       MAE         MPE     MAPE     MASE
Training set 0.0007096621 0.02815851 0.0190339 -0.08795954 2.423098 0.270375
                      ACF1
Training set -0.0004307023

Model Equation

A process \(x_t\) is said to be \(\operatorname{ARIMA}(p, d, q)\) if

\[ \nabla^d x_t=(1-B)^d x_t \] Where;

\[ \phi(B)=1+0.51 B+0.16 B^2+ 0.04 B^3 \]

\[ \theta(B)=1+ 0.52 B+ 0.14 B^2 \] #### Model Forecast

# Forecast the next 365 periods
forecast_result <- forecast(fit, h = 365)

# Display forecast accuracy
accuracy(forecast_result)
                       ME       RMSE       MAE         MPE     MAPE     MASE
Training set 0.0007096621 0.02815851 0.0190339 -0.08795954 2.423098 0.270375
                      ACF1
Training set -0.0004307023
# Plot the forecast
autoplot(forecast_result) +
  labs(title = "ARIMA(3,1,2) Forecast",
       x = "Time",
       y = "Predicted Values") +
  theme_minimal()

###### Plot Forecasts Using Mean, Naïve, Drift Methods, and ARIMA Fit ######
autoplot(ts_log_AAPL) +
  autolayer(meanf(ts_log_AAPL, h = 365), series = "Mean", PI = FALSE) +
  autolayer(naive(ts_log_AAPL, h = 365), series = "Naïve", PI = FALSE) +
  autolayer(rwf(ts_log_AAPL, drift = TRUE, h = 365), series = "Drift", PI = FALSE) +
  autolayer(forecast(fit, h = 365), series = "ARIMA Fit", PI = FALSE) +
  ggtitle("Apple Stock Price Forecast (Log-Transformed)") +
  xlab("Date") + ylab("Log Adjusted Close Price") +
  guides(colour = guide_legend(title = "Forecast Methods")) +
  theme_minimal()

It looks like the drift model is catching the trend better than the ARIMA model. This means that this ARIMA model is not sufficient. Which is true. This is financial data. an ARCH/GARCH model needs to be fitted for this data.

SARIMA Model

\[ARIMA(p,d,q)X(P,D,Q)_s\]

Class Example

We will use Egg Prices as our example dataset.

###### Load Egg Price Data and Rename Columns ######
eggs_data <- read.csv("eggs_price_2025.csv")
colnames(eggs_data) <- c("Date", "egg_prices")

###### Plot Original Data using Plotly ######
plot_ly(eggs_data, x = ~as.Date(Date), y = ~egg_prices, type = 'scatter', mode = 'lines',
        line = list(color = 'blue')) %>%
  layout(title = "Original Egg Prices",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Egg Prices"))

Upon plotting the data, we observe heteroscedasticity, where the variability increases over time. This non-constant variance can negatively impact model accuracy. To mitigate this, we apply a log transformation, which helps stabilize the variance.

The log-transformed data reduces heteroscedasticity, providing a more consistent trend and making it more suitable for reliable model fitting.

To confirm, let’s compare the logged-transformed data and the differenced data.

###### Plot Original and Log-Transformed Data using Plotly ######
eggs_data$log_egg_prices <- log(eggs_data$egg_prices)

plot_ly(eggs_data, x = ~as.Date(Date)) %>%
  add_lines(y = ~egg_prices, name = 'Original Prices', line = list(color = 'blue')) %>%
  add_lines(y = ~log_egg_prices, name = 'Log-Transformed Prices', line = list(color = 'red')) %>%
  layout(title = "Original vs Log-Transformed Egg Prices",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Price"))

Note: We are currently working with a univariate time series, focusing solely on Eggs prices.

library(patchwork)
###### Convert to Time Series Object ######
ts_eggs <- ts(eggs_data$egg_prices, 
              start = decimal_date(as.Date(min(eggs_data$Date))), 
              frequency = 12)  # Assuming monthly data

ts_log_eggs <- ts(eggs_data$log_egg_prices, 
                  start = decimal_date(as.Date(min(eggs_data$Date))), 
                  frequency = 12)

###### Differencing the Data ######
diff_eggs <- diff(ts_eggs)
diff_log_eggs <- diff(ts_log_eggs)

# Create plots for differenced data
p1 <- ggplot() +
  geom_line(aes(x = time(diff_eggs), y = diff_eggs), color = "blue") +
  labs(title = "Differenced Egg Prices",
       x = "Time",
       y = "Differenced Prices") +
  theme_minimal()

p2 <- ggplot() +
  geom_line(aes(x = time(diff_log_eggs), y = diff_log_eggs), color = "red") +
  labs(title = "Differenced Log-Transformed Egg Prices",
       x = "Time",
       y = "Differenced Log Prices") +
  theme_minimal()

# Combine plots using patchwork
p1 / p2

It is clear from the above data that the log transformed data should be used.

Seasonality

Check the seasonality:

###### Decompose the Time Series and Plot ######
autoplot(decompose(ts_log_eggs)) +
  ggtitle("Decomposition of Log Transformed Egg Prices") +
  xlab("Date") +
  ylab("Value")

There is a clear seasonal component here.

Moreover, check the lag plots for seasonal correlation at lags 12, 24, 36, 48.

###### Plot Lag Plot using gglagplot ######
gglagplot(ts_log_eggs, do.lines = FALSE, set.lags = c(12, 24, 36, 48)) +
  ggtitle("Lag Plot of Log Transformed Egg Prices")

There seem to be correlation at seasonal lag 1 and 2.

###### Plot ACF of Log Transformed Egg Prices using ggAcf ######
ggAcf(ts_log_eggs,50) +
  ggtitle("ACF of Log Transformed Egg Prices")

However, the above ACF plot makes it a little hard to identify seasonal correlations. This difficulty may stem from the strong autocorrelation caused by the underlying trend in the data.

Differencing the data

Ordinary Differencing

###### First and Second Order Differencing for Log Egg Prices ######
# First-order differencing
diff_log_1 <- diff(ts_log_eggs)

# Second-order differencing
diff_log_2 <- diff(ts_log_eggs, differences = 2)

###### Plot ACF for First and Second Order Differenced Log Series ######
# ACF plot for first-order differenced log series
acf_log_plot_1 <- ggAcf(diff_log_1, 50) +
  ggtitle("ACF of First-Order Differenced Log Series") +
  theme_minimal()

# ACF plot for second-order differenced log series
acf_log_plot_2 <- ggAcf(diff_log_2, 50) +
  ggtitle("ACF of Second-Order Differenced Log Series") +
  theme_minimal()

# Display both ACF plots side by side
acf_log_plot_1 / acf_log_plot_2

The above plots clearly indicate that first-order differencing is sufficient, while second-order differencing results in over-differencing. Additionally, strong seasonal correlations are evident at seasonal lags of 12, 24, 36, and so on.

Seasonal Differencing

###### Apply Only Seasonal Differencing (Lag = 12) ######
seasonal_only_diff <- diff(ts_log_eggs, lag = 12)

###### Plot ACF for the Only Seasonally Differenced Series ######
ggAcf(seasonal_only_diff) +
  ggtitle("ACF of Only Seasonally Differenced Log Egg Prices") +
  theme_minimal()

Seasonal differencing alone is insufficient, as significant autocorrelation from the underlying trend components remains in the series.

Both Ordinary Differencing and Seasonal differencing

###### Apply First-Order Differencing ######
first_diff <- diff(ts_log_eggs)

###### Apply Seasonal Differencing (Lag = 12) on Top of First-Order Differencing ######
seasonal_diff <- diff(first_diff, lag = 12)

###### Plot ACF and PACF for the Differenced Series ######
ggAcf(seasonal_diff) +
  ggtitle("ACF of Seasonally Differenced Log Egg Prices") +
  theme_minimal()

This series appears closer to weak stationarity. The ACF plot shows clear autocorrelations at lags q = 0,2, 4 and Q = 1.

Additionally, both first-order differencing (d = 1) and seasonal differencing (D = 1) have been applied to the data.

Let’s check the PACF plot to check the AR parameters.

ggPacf(seasonal_diff) +
  ggtitle("PACF of Seasonally Differenced Log Egg Prices") +
  theme_minimal()

The PACF plot shows clear autocorrelations at lags p = 0,2,3 and P = 0,1,2.

Model Diagnostics

We have several choices for our models. Let’s check the model diagnostics to find the best model.

The Models chosen by the above analysis:

  • ARIMA(2,1,1)x(1,1,1)[12]
  • ARIMA(3,1,2)x(2,0,0)[12]
  • ARIMA(3,1,2)x(2,1,1)[12]
  • ARIMA(0,1,2)x(0,1,1)[12]

ARIMA(2,1,1)x(1,1,1)[12]

###### Set Seed for Reproducibility ######
set.seed(345)

###### Capture ARIMA Model Output for Diagnostics ######

# Model diagnostics for ARIMA(2,1,1)x(1,1,1)[12]
model_output <- capture.output(sarima(ts_log_eggs, 2, 1, 1, 1, 1, 1, 12))

# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output)  # Locate where coefficient details start
end_line <- length(model_output)  # Last line of output

# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")
Coefficients: 
     Estimate     SE  t.value p.value
ar1    0.1891 0.1973   0.9581  0.3385
ar2    0.1903 0.0461   4.1308  0.0000
ma1   -0.3120 0.1984  -1.5722  0.1165
sar1  -0.0472 0.0494  -0.9561  0.3395
sma1  -0.9449 0.0298 -31.7492  0.0000

sigma^2 estimated as 0.003918233 on 522 degrees of freedom 
 
AIC = -2.628597  AICc = -2.628378  BIC = -2.580014 
 

The Residual Plot shows nearly consistent fluctuation around zero, suggesting that the residuals are nearly stationary with a constant mean and finite variance over time.

The Autocorrelation Function (ACF) of the residuals reveals no significant autocorrelations indicating a good model.

The Q-Q Plot indicates that the residuals follow a near-normal distribution, with minor deviations at the tails, which is typical in time series data.

The Ljung-Box Test p-values remain mostly above the 0.05 significance level, implying a nearly good fit.

Coefficient Significance: Here coefficients for ar1, ma1, sar1 are not significant. This may question about this model being a good fit based on this results.

ARIMA(3,1,2)x(2,0,0)[12]

###### Load Necessary Libraries ######
library(astsa)

###### Set Seed for Reproducibility ######
set.seed(345)

###### Capture ARIMA Model Output for Diagnostics ######

# Model diagnostics for ARIMA(3,1,2)x(2,0,0)[12]
model_output <- capture.output(sarima(ts_log_eggs, 3, 1, 2, 2, 0, 0, 12))

# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output)  # Locate where coefficient details start
end_line <- length(model_output)  # Last line of output

# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")
Coefficients: 
         Estimate     SE t.value p.value
ar1        1.3388 0.2060  6.4980  0.0000
ar2       -0.5177 0.1765 -2.9326  0.0035
ar3       -0.1483 0.0482 -3.0741  0.0022
ma1       -1.4196 0.2042 -6.9528  0.0000
ma2        0.7467 0.1519  4.9173  0.0000
sar1       0.1100 0.0529  2.0797  0.0380
sar2       0.1924 0.0483  3.9806  0.0001
constant   0.0029 0.0040  0.7198  0.4719

sigma^2 estimated as 0.004398357 on 531 degrees of freedom 
 
AIC = -2.552677  AICc = -2.552173  BIC = -2.481049 
 

The Residual Plot shows nearly consistent fluctuation around zero, suggesting that the residuals are nearly stationary with a constant mean and finite variance over time.

The Autocorrelation Function (ACF) of the residuals reveals no significant autocorrelations indicating a good model.

The Q-Q Plot indicates that the residuals follow a near-normal distribution, with minor deviations at the tails, which is typical in time series data.

The Ljung-Box Test p-values remain mostly above the 0.05 significance level, implying a nearly good fit.

Coefficient Significance: Here coefficients for ar2, ar3, sar2 are not significant. This may question about this model being a good fit based on this results.

ARIMA(3,1,2)x(2,1,1)[12]

###### Set Seed for Reproducibility ######
set.seed(345)

###### Capture ARIMA Model Output for Diagnostics ######

# Model diagnostics for ARIMA(3,1,2)x(2,1,1)[12]
model_output <- capture.output(sarima(ts_log_eggs, 3, 1, 2, 2, 1, 1, 12))

# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output)  # Locate where coefficient details start
end_line <- length(model_output)  # Last line of output

# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")
Coefficients: 
     Estimate     SE  t.value p.value
ar1    1.4825 0.1562   9.4897  0.0000
ar2   -0.3041 0.1712  -1.7763  0.0763
ar3   -0.2368 0.0438  -5.4086  0.0000
ma1   -1.6279 0.1594 -10.2147  0.0000
ma2    0.6625 0.1591   4.1649  0.0000
sar1   0.0087 0.0535   0.1626  0.8709
sar2   0.0168 0.0517   0.3254  0.7450
sma1  -0.9540 0.0333 -28.6119  0.0000

sigma^2 estimated as 0.003834145 on 519 degrees of freedom 
 
AIC = -2.635531  AICc = -2.635004  BIC = -2.562657 
 

ARIMA(0,1,2)x(0,1,1)[12]

###### Set Seed for Reproducibility ######
set.seed(345)

###### Capture ARIMA Model Output for Diagnostics ######

# Model diagnostics for ARIMA(0,1,2)x(0,1,1)[12]
model_output <- capture.output(sarima(ts_log_eggs, 0, 1, 2, 0, 1, 1, 12))

# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output)  # Locate where coefficient details start
end_line <- length(model_output)  # Last line of output

# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")
Coefficients: 
     Estimate     SE  t.value p.value
ma1   -0.1281 0.0438  -2.9248  0.0036
ma2    0.1673 0.0422   3.9686  0.0001
sma1  -0.9572 0.0290 -32.9526  0.0000

sigma^2 estimated as 0.003917763 on 524 degrees of freedom 
 
AIC = -2.633045  AICc = -2.632958  BIC = -2.600656 
 

Comparing all the model diagnostic models and lower AIC values, we will choose ARIMA(3,1,2)x(2,1,1)[12].

Fit the model

###### Fit the Final ARIMA Model ######
# Based on diagnostics and AIC values, ARIMA(3,1,2)x(2,1,1)[12] was selected
fit <- Arima(ts_log_eggs, order = c(3, 1, 2), seasonal = c(2, 1, 1))

###### Display Model Summary ######
summary(fit)
Series: ts_log_eggs 
ARIMA(3,1,2)(2,1,1)[12] 

Coefficients:
         ar1      ar2      ar3      ma1     ma2    sar1    sar2     sma1
      1.4825  -0.3041  -0.2368  -1.6279  0.6625  0.0087  0.0168  -0.9540
s.e.  0.1562   0.1712   0.0438   0.1594  0.1591  0.0535  0.0517   0.0333

sigma^2 = 0.003893:  log likelihood = 703.46
AIC=-1388.92   AICc=-1388.58   BIC=-1350.52

Training set error measures:
                      ME       RMSE        MAE       MPE     MAPE      MASE
Training set 0.001965082 0.06117061 0.04498489 -12.92247 66.23099 0.2957145
                    ACF1
Training set 0.005535244

Forecaseting

###### Forecast the Next 100 Periods ######
forecast_result <- forecast(fit, h = 120)

###### Display Forecast Accuracy ######
accuracy(forecast_result)
                      ME       RMSE        MAE       MPE     MAPE      MASE
Training set 0.001965082 0.06117061 0.04498489 -12.92247 66.23099 0.2957145
                    ACF1
Training set 0.005535244
###### Plot the Forecast ######
autoplot(forecast_result) +
  labs(title = "ARIMA(3,1,2)x(2,1,1)[12] Forecast",
       x = "Time",
       y = "Predicted Log Egg Prices") +
  theme_minimal()

Benchmark Methods

###### Forecast Comparison Plot ######
autoplot(ts_log_eggs) +
  autolayer(meanf(ts_log_eggs, h = 120), series = "Mean", PI = FALSE) +
  autolayer(naive(ts_log_eggs, h = 120), series = "Naïve", PI = FALSE) +
  autolayer(snaive(ts_log_eggs, h = 120), series = "SNaïve", PI = FALSE) +
  autolayer(rwf(ts_log_eggs, h = 120, drift = TRUE), series = "Drift", PI = FALSE) +
  autolayer(forecast(fit, h = 120), series = "SARIMA Fit", PI = FALSE) +
  guides(colour = guide_legend(title = "Forecast Methods")) +
  ggtitle("Forecast Comparison for Log Transformed Egg Prices") +
  ylab("Log Egg Prices") +
  xlab("Time")

Based on the benchmark methods it is clear that the SARIMA model was able to catch the trend in the egg prices.

Original Vs. Fitted Values

For the log transformed data

###### Forecast the Next 365 Periods ######
forecast_result <- forecast(fit, h = 365)

###### Display Forecast Accuracy ######
accuracy(forecast_result)
                      ME       RMSE        MAE       MPE     MAPE      MASE
Training set 0.001965082 0.06117061 0.04498489 -12.92247 66.23099 0.2957145
                    ACF1
Training set 0.005535244
###### Plot Original Data vs Fitted Values ######
autoplot(ts_log_eggs, series = "Original Data") +
  autolayer(forecast_result$fitted, series = "Fitted Values", PI = FALSE) +
  labs(title = "Original Data vs Fitted Values",
       x = "Time",
       y = "Log Egg Prices") +
  theme_minimal() +
  guides(colour = guide_legend(title = "Series"))

For the original data

###### Forecast the Next 365 Periods ######
forecast_result <- forecast(fit, h = 365)

###### Display Forecast Accuracy ######
accuracy(forecast_result)
                      ME       RMSE        MAE       MPE     MAPE      MASE
Training set 0.001965082 0.06117061 0.04498489 -12.92247 66.23099 0.2957145
                    ACF1
Training set 0.005535244
###### Plot Original Egg Prices vs Fitted Values ######
# Ensure Date is in Date format
eggs_data$Date <- as.Date(eggs_data$Date)

# Convert to time series format
start_year <- as.numeric(format(min(eggs_data$Date), "%Y"))
start_month <- as.numeric(format(min(eggs_data$Date), "%m"))

egg_prices_ts <- ts(eggs_data$egg_prices, start = c(start_year, start_month), frequency = 12)
fitted_ts <- ts(exp(forecast_result$fitted), start = c(start_year, start_month), frequency = 12)

# Plot
autoplot(egg_prices_ts, series = "Original Data") +
  autolayer(fitted_ts, series = "Fitted Values", PI = FALSE) +
  labs(title = "Original Egg Prices vs Fitted Values",
       x = "Time",
       y = "Egg Prices") +
  theme_minimal() +
  guides(colour = guide_legend(title = "Series"))

Very close prediction.

Prophet model

Remember the prophet model works better when there is strong seasonality.

###### Load Required Libraries ######
library(prophet)
library(dplyr)

###### Prepare the Dataset for Prophet ######
# Convert the log-transformed egg prices data for Prophet
eggs_data_prophet <- eggs_data %>% 
  select(Date, log_egg_prices) %>% 
  rename(ds = Date, y = log_egg_prices)

###### Initialize and Fit the Prophet Model ######
prophet_model <- prophet(eggs_data_prophet)

###### Extend the Forecast Horizon by 365 Days ######
future_dates <- make_future_dataframe(prophet_model, periods = 365)

###### Generate Future Predictions ######
predictions <- predict(prophet_model, future_dates)

###### Visualize the Forecast ######
plot(prophet_model, predictions) +
  labs(title = "Prophet Forecast for Log Egg Prices",
       x = "Date",
       y = "Log Egg Prices")

Example 2:

US monthly natural gas consumption: 2000 - 2019. Units: Billion Cubic Feet

  1. TS plot and the ACF
###### Plot US Monthly Natural Gas Consumption ######
autoplot(USgas) +
  ggtitle("US Monthly Natural Gas Consumption") +
  xlab("Year") + 
  ylab("Gas Consumption") +
  theme_minimal()

###### ACF Plot Using Base R ######
acf(USgas, main = "ACF of US Monthly Natural Gas Consumption")

###### ACF Plot Using ggAcf ######
ggAcf(USgas, lag.max = 40) +
  ggtitle("ACF of US Monthly Natural Gas Consumption (40 Lags)") 

you can see the difference of the ACF plot when using ggAcf() and acf() function. When you use ggAcf(), the seasonal lags are at \(12,24,36..\) but acf() indicate \(1,2,..\) for seasonal lags and decimal values for ordinary lags.

  1. Lag Plot
########## Interactive plots#########
ts_lags(USgas,lags=c(12, 24, 36, 48)) 
#ts_lags(USgas, lags = c(12, 24, 36, 48))
######################################
gglagplot(USgas, do.lines=FALSE, set.lags = c(12, 24, 36, 48))+ggtitle("US monthly natural gas consumption") 

Apparent seasonal correlation can be seen by the positive correlation at seasonal lags 12,24,36,48..

  1. Decomposing

We are using decompose() function to look at the seasonal component.

dec2=decompose(USgas,type = c("additive", "multiplicative"))
plot(dec2)

  1. Differencing
str(USgas) # this is already time series
 Time-Series [1:238] from 2000 to 2020: 2510 2331 2051 1783 1633 ...
USgas %>% diff() %>% ggtsdisplay() #first ordinary differencing

USgas %>% diff(lag=12) %>% ggtsdisplay() #first seasonal differencing

USgas %>% diff(lag=12) %>% diff() %>% ggtsdisplay() #do both

You can see that the first ordinary differencing is not enough because we can clearly see the seasonal correlation. Therefore, we need to proceed with the seasonal differencing.

Second set of plots : only seasonal differencing. It looks like there aren’t much correlation left when we do the seasonal differencing. Therefore, it could be D=1 and d=0. Let’s keep this as one option.

Third set of plots: doing seasonal differencing and ordinary differencing. looks more stationary.

Here: by ACF Plot: q=0,1,2,3; Q=1,2 and PACF plot: p=0,1,2; P=1,2

I’m choosing 0 because when you have more parameters, sometimes, some parameters are insignificant in the presence of other parameters.

  1. Model Fitting
######################## Check for different combinations ########


#write a funtion
SARIMA.c=function(p1,p2,q1,q2,P1,P2,Q1,Q2,data){
  
  #K=(p2+1)*(q2+1)*(P2+1)*(Q2+1)
  
  temp=c()
  d=1
  D=1
  s=12
  
  i=1
  temp= data.frame()
  ls=matrix(rep(NA,9*35),nrow=35)
  
  
  for (p in p1:p2)
  {
    for(q in q1:q2)
    {
      for(P in P1:P2)
      {
        for(Q in Q1:Q2)
        {
          if(p+d+q+P+D+Q<=9)
          {
            
            model<- Arima(data,order=c(p-1,d,q-1),seasonal=c(P-1,D,Q-1))
            ls[i,]= c(p-1,d,q-1,P-1,D,Q-1,model$aic,model$bic,model$aicc)
            i=i+1
            #print(i)
            
          }
          
        }
      }
    }
    
  }
  
  
  temp= as.data.frame(ls)
  names(temp)= c("p","d","q","P","D","Q","AIC","BIC","AICc")
  
  temp
  
}
# q=0,1,2,3; Q=1,2 and PACF plot: p=0,1,2; P=1,2, D=1 and d=0,1
output=SARIMA.c(p1=1,p2=3,q1=1,q2=4,P1=1,P2=3,Q1=1,Q2=3,data=USgas)
#output

knitr::kable(output)
p d q P D Q AIC BIC AICc
0 1 0 0 1 0 2892.631 2896.047 2892.649
0 1 0 0 1 1 2805.310 2812.142 2805.364
0 1 0 0 1 2 2805.986 2816.234 2806.094
0 1 0 1 1 0 2866.355 2873.187 2866.409
0 1 0 1 1 1 2806.594 2816.842 2806.702
0 1 0 1 1 2 2805.556 2819.221 2805.738
0 1 0 2 1 0 2828.098 2838.346 2828.206
0 1 0 2 1 1 2798.563 2812.227 2798.745
0 1 1 0 1 0 2854.973 2861.805 2855.027
0 1 1 0 1 1 2769.757 2780.006 2769.866
0 1 1 0 1 2 2769.439 2783.103 2769.621
0 1 1 1 1 0 2831.546 2841.794 2831.655
0 1 1 1 1 1 2770.495 2784.159 2770.677
0 1 1 2 1 0 2797.594 2811.259 2797.776
0 1 2 0 1 0 2831.956 2842.204 2832.064
0 1 2 0 1 1 2754.034 2767.699 2754.216
0 1 2 1 1 0 2809.699 2823.364 2809.881
0 1 3 0 1 0 2833.595 2847.260 2833.777
1 1 0 0 1 0 2878.839 2885.672 2878.893
1 1 0 0 1 1 2789.262 2799.511 2789.371
1 1 0 0 1 2 2789.799 2803.464 2789.981
1 1 0 1 1 0 2852.613 2862.862 2852.722
1 1 0 1 1 1 2790.451 2804.116 2790.633
1 1 0 2 1 0 2816.768 2830.433 2816.950
1 1 1 0 1 0 2831.655 2841.903 2831.763
1 1 1 0 1 1 2752.973 2766.637 2753.155
1 1 1 1 1 0 2809.715 2823.379 2809.897
1 1 2 0 1 0 2833.110 2846.775 2833.292
2 1 0 0 1 0 2845.213 2855.462 2845.322
2 1 0 0 1 1 2768.783 2782.447 2768.964
2 1 0 1 1 0 2822.087 2835.751 2822.268
2 1 1 0 1 0 2846.528 2860.193 2846.710
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
output[which.min(output$AIC),] 
   p d q P D Q      AIC      BIC     AICc
26 1 1 1 0 1 1 2752.973 2766.637 2753.155
output[which.min(output$BIC),]
   p d q P D Q      AIC      BIC     AICc
26 1 1 1 0 1 1 2752.973 2766.637 2753.155
output[which.min(output$AICc),]
   p d q P D Q      AIC      BIC     AICc
26 1 1 1 0 1 1 2752.973 2766.637 2753.155

ARIMA(1,1,1)x(0,1,1)12 is the model suggested above.

###### Set Seed for Reproducibility ######
set.seed(123)

###### Fit SARIMA Model and Capture Output ######
model_output <- capture.output(sarima(USgas, 1, 1, 1, 0, 1, 1, 12))

###### Find the Line Numbers Dynamically Based on a Keyword ######
start_line <- grep("Coefficients", model_output)  # Locate where coefficient details start
end_line <- length(model_output)  # Last line of output

###### Print the Relevant Section Automatically ######
cat(model_output[start_line:end_line], sep = "\n")
Coefficients: 
     Estimate     SE  t.value p.value
ar1    0.4112 0.0758   5.4229       0
ma1   -0.9072 0.0370 -24.4854       0
sma1  -0.8072 0.0484 -16.6757       0

sigma^2 estimated as 10925.36 on 222 degrees of freedom 
 
AIC = 12.23543  AICc = 12.23592  BIC = 12.29616 
 

The Standard Residual Plot appears good, displaying okay stationarity with a nearly constant mean and variation.

The Autocorrelation Function (ACF) plot shows almost no correlation indicating that the model has harnessed everything that left is white noise. This indicates a good model fit.

The Quantile-Quantile (Q-Q) Plot still demonstrates near-normality.

The Ljung-Box test results reveal values below the 0.05 (5% significance) threshold, indicating there’s some significant correlation left.

$ttable: all coefficients are significant.

  1. Model Fitting
fit <- Arima(USgas, order=c(1,1,1), seasonal=c(0,1,1))
summary(fit)
Series: USgas 
ARIMA(1,1,1)(0,1,1)[12] 

Coefficients:
         ar1      ma1     sma1
      0.4112  -0.9072  -0.8072
s.e.  0.0758   0.0370   0.0484

sigma^2 = 11073:  log likelihood = -1372.49
AIC=2752.97   AICc=2753.15   BIC=2766.64

Training set error measures:
                   ME     RMSE      MAE       MPE     MAPE      MASE
Training set 6.818919 101.6306 75.78126 0.1440502 3.564759 0.6519017
                     ACF1
Training set -0.003844906
  1. Forecasting
fit %>% forecast(h=36) %>% autoplot() #next 3 years

sarima.for(USgas, 36, 1,1,1,0,1,1,12)

$pred
          Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
2019                                                                        
2020 3445.352 3024.296 2910.105 2403.679 2254.036 2256.930 2475.676 2479.352
2021 3505.170 3089.649 2977.734 2472.244 2322.987 2326.038 2544.849 2548.552
2022 3574.389 3158.868 3046.953 2541.463 2392.206 2395.257 2614.068 2617.771
          Sep      Oct      Nov      Dec
2019                   2694.431 3158.320
2020 2285.920 2406.314 2708.053 3204.677
2021 2355.131 2475.530 2777.271 3273.895
2022 2424.350 2544.749                  

$se
          Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
2019                                                                        
2020 121.1862 123.2756 124.7305 125.9540 127.0832 128.1697 129.2337 130.2838
2021 141.6944 143.2182 144.6274 145.9839 147.3121 148.6220 149.9179 151.2016
2022 163.6161 165.3688 167.0047 168.5853 170.1355 171.6652 173.1787 174.6781
          Sep      Oct      Nov      Dec
2019                   104.5300 117.0569
2020 131.3232 132.3536 137.3318 139.8920
2021 152.4741 153.7359 158.8085 161.5838
2022 176.1643 177.6380                  
  1. Compare with Benchmark methods
fit <- Arima(USgas, order=c(1,1,1), seasonal=c(0,1,1))

autoplot(USgas) +
  autolayer(meanf(USgas, h=36),
            series="Mean", PI=FALSE) +
  autolayer(naive(USgas, h=36),
            series="Naïve", PI=FALSE) +
  autolayer(snaive(USgas, h=36),
            series="SNaïve", PI=FALSE)+
  autolayer(rwf(USgas, h=36, drift=TRUE),
            series="Drift", PI=FALSE)+
  autolayer(forecast(fit,36), 
            series="fit",PI=FALSE) +
  guides(colour=guide_legend(title="Forecast"))

fit looks a little better than snaive because the fit kind of captures the trend.

f2 <- snaive(USgas, h=36) 

accuracy(f2)
                   ME    RMSE      MAE      MPE     MAPE MASE     ACF1
Training set 37.31549 149.597 116.2465 1.498393 5.479521    1 0.468312
summary(fit)
Series: USgas 
ARIMA(1,1,1)(0,1,1)[12] 

Coefficients:
         ar1      ma1     sma1
      0.4112  -0.9072  -0.8072
s.e.  0.0758   0.0370   0.0484

sigma^2 = 11073:  log likelihood = -1372.49
AIC=2752.97   AICc=2753.15   BIC=2766.64

Training set error measures:
                   ME     RMSE      MAE       MPE     MAPE      MASE
Training set 6.818919 101.6306 75.78126 0.1440502 3.564759 0.6519017
                     ACF1
Training set -0.003844906

Model error measurements are much lower than snaive benchmark method.

Therefore, our fitted model is good.

Example 3: Monthly Airline Passenger Numbers 1949-1960

This follows the same procedure as we discussed before.

###### Load and Inspect AirPassengers Data ######
x <- AirPassengers
str(x)
 Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...
###### Plot Original Data ######
autoplot(x) +
  ggtitle("AirPassengers Data") +
  xlab("Year") +
  ylab("Number of Passengers") +
  theme_minimal()

###### Apply Log Transformation and Differencing ######
lx <- log(x)
dlx <- diff(lx)
ddlx <- diff(dlx, lag = 12)

###### Plot Transformed Series ######
plot.ts(cbind(Original = x, Log_Transformed = lx, First_Diff = dlx, Seasonal_Diff = ddlx), 
        main = "AirPassengers Data Transformations")

###### Seasonal Plots ######
par(mfrow = c(2, 1))
monthplot(dlx, main = "Month Plot of First Differenced Log Data")
monthplot(ddlx, main = "Month Plot of Seasonal Differenced Data")

###### ACF and PACF Plots ######
acf2(ddlx, 50)

      [,1]  [,2]  [,3]  [,4] [,5] [,6]  [,7]  [,8] [,9] [,10] [,11] [,12] [,13]
ACF  -0.34  0.11 -0.20  0.02 0.06 0.03 -0.06  0.00 0.18 -0.08  0.06 -0.39  0.15
PACF -0.34 -0.01 -0.19 -0.13 0.03 0.03 -0.06 -0.02 0.23  0.04  0.05 -0.34 -0.11
     [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
ACF  -0.06  0.15 -0.14  0.07  0.02 -0.01 -0.12  0.04 -0.09  0.22 -0.02  -0.1
PACF -0.08 -0.02 -0.14  0.03  0.11 -0.01 -0.17  0.13 -0.07  0.14 -0.07  -0.1
     [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37]
ACF   0.05 -0.03  0.05 -0.02 -0.05 -0.05  0.20 -0.12  0.08 -0.15 -0.01  0.05
PACF -0.01  0.04 -0.09  0.05  0.00 -0.10 -0.02  0.01 -0.02  0.02 -0.16 -0.03
     [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49]
ACF   0.03 -0.02 -0.03 -0.07  0.10 -0.09  0.03 -0.04 -0.04  0.11 -0.05  0.11
PACF  0.01  0.05 -0.08 -0.17  0.07 -0.10 -0.06 -0.03 -0.12 -0.01 -0.05  0.09
     [,50]
ACF  -0.02
PACF  0.13
log(x) %>% diff() %>% ggtsdisplay()

log(x) %>% diff(12) %>% ggtsdisplay()

log(x) %>% diff() %>%diff(12) %>% ggtsdisplay() #this is better

#q=1,3; Q=1; p=1,3; P=1
# #q=1,3; Q=1; p=1,3; P=1, D=1 and d=0,1
output=SARIMA.c(p1=1,p2=4,q1=1,q2=4,P1=1,P2=2,Q1=1,Q2=2,data=log(x))
#output

#knitr::kable(output)
output[which.min(output$AIC),] 
  p d q P D Q       AIC       BIC      AICc
6 0 1 1 0 1 1 -483.3991 -474.7735 -483.2101
output[which.min(output$BIC),]
  p d q P D Q       AIC       BIC      AICc
6 0 1 1 0 1 1 -483.3991 -474.7735 -483.2101
output[which.min(output$AICc),]
  p d q P D Q       AIC       BIC      AICc
6 0 1 1 0 1 1 -483.3991 -474.7735 -483.2101
###### Fit SARIMA Models and Display Diagnostics ######

# ARIMA(1,1,1)x(0,1,1)[12]
model_output11 <- capture.output(sarima(lx, 1, 1, 1, 0, 1, 1, 12))

cat(model_output11[grep("Coefficients", model_output11):length(model_output11)], sep = "\n")
Coefficients: 
     Estimate     SE t.value p.value
ar1    0.1960 0.2475  0.7921  0.4298
ma1   -0.5784 0.2132 -2.7127  0.0076
sma1  -0.5643 0.0747 -7.5544  0.0000

sigma^2 estimated as 0.001341097 on 128 degrees of freedom 
 
AIC = -3.678622  AICc = -3.677179  BIC = -3.59083 
 
# ARIMA(0,1,1)x(0,1,1)[12] - Best Model
model_output12 <- capture.output(sarima(lx, 0, 1, 1, 0, 1, 1, 12))

cat(model_output12[grep("Coefficients", model_output12):length(model_output12)], sep = "\n")
Coefficients: 
     Estimate     SE t.value p.value
ma1   -0.4018 0.0896 -4.4825       0
sma1  -0.5569 0.0731 -7.6190       0

sigma^2 estimated as 0.001348035 on 129 degrees of freedom 
 
AIC = -3.690069  AICc = -3.689354  BIC = -3.624225 
 
# ARIMA(1,1,0)x(0,1,1)[12]
model_output13 <- capture.output(sarima(lx, 1, 1, 0, 0, 1, 1, 12))

cat(model_output13[grep("Coefficients", model_output13):length(model_output13)], sep = "\n")
Coefficients: 
     Estimate     SE t.value p.value
ar1   -0.3395 0.0822 -4.1295   1e-04
sma1  -0.5619 0.0748 -7.5109   0e+00

sigma^2 estimated as 0.00136738 on 129 degrees of freedom 
 
AIC = -3.675493  AICc = -3.674777  BIC = -3.609649 
 

Forecasting

#chosen model : forecasting
sarima.for(lx, 12, 0,1,1, 0,1,1,12)

$pred
          Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
1961 6.110186 6.053775 6.171715 6.199300 6.232556 6.368779 6.507294 6.502906
          Sep      Oct      Nov      Dec
1961 6.324698 6.209008 6.063487 6.168025

$se
            Jan        Feb        Mar        Apr        May        Jun
1961 0.03671562 0.04278291 0.04809072 0.05286830 0.05724856 0.06131670
            Jul        Aug        Sep        Oct        Nov        Dec
1961 0.06513124 0.06873441 0.07215787 0.07542612 0.07855851 0.08157070

Benchmark: Seasonal Naive

fit=Arima(lx,order=c(0,1,1),seasonal=c(0,1,1))
autoplot(lx) +
  autolayer(meanf(lx, h=12),
            series="Mean", PI=FALSE) +
  autolayer(naive(lx, h=12),
            series="Naïve", PI=FALSE) +
  autolayer(snaive(lx, h=12),
            series="SNaïve", PI=FALSE)+
  autolayer(rwf(lx, drift=TRUE, h=12),
            series="Drift", PI=FALSE)+
  autolayer(forecast(fit,12), 
            series="fit",PI=FALSE) +
  guides(colour=guide_legend(title="Forecast"))

###### Apply Seasonal Naive Forecast to Log-Transformed AirPassengers Data ######
f2 <- snaive(lx, h = 36)

###### Check Residuals of the Seasonal Naive Model ######
checkresiduals(f2)


    Ljung-Box test

data:  Residuals from Seasonal naive method
Q* = 267.65, df = 24, p-value < 2.2e-16

Model df: 0.   Total lags used: 24
###### Evaluate Forecast Accuracy ######
accuracy(f2)
                    ME     RMSE      MAE      MPE     MAPE MASE      ACF1
Training set 0.1198215 0.134642 0.121042 2.159718 2.182627    1 0.7136876
###### Display Model Summary ######
summary(f2)

Forecast method: Seasonal naive method

Model Information:
Call: snaive(y = lx, h = 36) 

Residual sd: 0.1346 

Error measures:
                    ME     RMSE      MAE      MPE     MAPE MASE      ACF1
Training set 0.1198215 0.134642 0.121042 2.159718 2.182627    1 0.7136876

Forecasts:
         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
Jan 1961       6.033086 5.860536 6.205637 5.769193 6.296980
Feb 1961       5.968708 5.796157 6.141258 5.704814 6.232601
Mar 1961       6.037871 5.865320 6.210422 5.773977 6.301764
Apr 1961       6.133398 5.960847 6.305949 5.869505 6.397291
May 1961       6.156979 5.984428 6.329530 5.893086 6.420872
Jun 1961       6.282267 6.109716 6.454817 6.018373 6.546160
Jul 1961       6.432940 6.260389 6.605491 6.169047 6.696834
Aug 1961       6.406880 6.234329 6.579431 6.142987 6.670773
Sep 1961       6.230481 6.057931 6.403032 5.966588 6.494375
Oct 1961       6.133398 5.960847 6.305949 5.869505 6.397291
Nov 1961       5.966147 5.793596 6.138697 5.702253 6.230040
Dec 1961       6.068426 5.895875 6.240976 5.804532 6.332319
Jan 1962       6.033086 5.789063 6.277110 5.659885 6.406288
Feb 1962       5.968708 5.724684 6.212731 5.595506 6.341909
Mar 1962       6.037871 5.793847 6.281894 5.664669 6.411073
Apr 1962       6.133398 5.889375 6.377422 5.760196 6.506600
May 1962       6.156979 5.912956 6.401002 5.783777 6.530181
Jun 1962       6.282267 6.038243 6.526290 5.909065 6.655468
Jul 1962       6.432940 6.188917 6.676964 6.059738 6.806142
Aug 1962       6.406880 6.162857 6.650903 6.033678 6.780082
Sep 1962       6.230481 5.986458 6.474505 5.857280 6.603683
Oct 1962       6.133398 5.889375 6.377422 5.760196 6.506600
Nov 1962       5.966147 5.722123 6.210170 5.592945 6.339348
Dec 1962       6.068426 5.824402 6.312449 5.695224 6.441627
Jan 1963       6.033086 5.734220 6.331953 5.576009 6.490163
Feb 1963       5.968708 5.669841 6.267574 5.511631 6.425784
Mar 1963       6.037871 5.739004 6.336737 5.580794 6.494948
Apr 1963       6.133398 5.834532 6.432265 5.676321 6.590475
May 1963       6.156979 5.858112 6.455845 5.699902 6.614056
Jun 1963       6.282267 5.983400 6.581133 5.825190 6.739344
Jul 1963       6.432940 6.134074 6.731807 5.975863 6.890017
Aug 1963       6.406880 6.108013 6.705746 5.949803 6.863957
Sep 1963       6.230481 5.931615 6.529348 5.773405 6.687558
Oct 1963       6.133398 5.834532 6.432265 5.676321 6.590475
Nov 1963       5.966147 5.667280 6.265013 5.509070 6.423224
Dec 1963       6.068426 5.769559 6.367292 5.611349 6.525502

Example 4 : Corticosteroid drug sales in Australia

We will try to forecast monthly corticosteroid drug sales in Australia. These are known as H02 drugs under the Anatomical Therapeutic Chemical classification scheme.

###### Plot Original H02 Sales Data ######
autoplot(h02) +
  ggtitle("H02 Sales Data") +
  xlab("Year") +
  ylab("Sales (Million Scripts)") +
  theme_minimal()

###### Log-Transform H02 Sales Data ######
lh02 <- log(h02)

###### Plot Original and Log-Transformed H02 Sales Data ######
cbind("H02 Sales (Million Scripts)" = h02,
      "Log H02 Sales" = lh02) %>%
  autoplot(facets = TRUE) +
  ggtitle("Original and Log-Transformed H02 Sales") +
  xlab("Year") +
  ylab("") +
  theme_minimal()

Data from July 1991 to June 2008 are plotted. There is a small increase in the variance with the level, so we take logarithms to stabilise the variance.

The data are strongly seasonal and obviously non-stationary, so seasonal differencing will be used. The seasonally differenced data are shown below. It is not clear at this point whether we should do another difference or not.

The last few observations appear to be different (more variable) from the earlier data. This may be due to the fact that data are sometimes revised when earlier sales are reported late.

###### Plot ACF of Log-Transformed H02 Sales Data ######
ggAcf(lh02) +
  ggtitle("ACF of Log-Transformed H02 Sales")

We can see high correlation and seasonal correlation. Therefore, let’s try seasonal differencing.

###### Apply Seasonal Differencing (D=1) to Log-Transformed H02 Sales Data ######

# Seasonal differencing using pipe
lh02 %>% 
  diff(lag = 12) %>% 
  ggtsdisplay(xlab = "Year", 
              main = "Seasonally Differenced H02 Scripts")

# Alternative approach using explicit variable
sdf <- diff(lh02, lag = 12)

###### Plot ACF and PACF for Seasonally Differenced Data ######
ggAcf(sdf) +
  ggtitle("ACF of Seasonally Differenced H02 Sales") 

ggPacf(sdf) +
  ggtitle("PACF of Seasonally Differenced H02 Sales") 

From the above ACF plot, You can see that seasonal correlation is removed but there is ordinary correlation left.

Let’s do an ordinary differencing on top of seasonal differencing.

lh02 %>% diff(lag=12) %>%diff() %>%
  ggtsdisplay(xlab="Year",
    main="Seasonally & Ordinary differenced H02 scripts")

In the plots of the differenced data, there are spikes in the PACF at lags 12 and almost at 24, and seasonal lag 12 in the ACF. The pattern in the ACF is not indicative of any simple model.

So q=0,1 ; p=1,2; Q=0,1,2; P=0,1; d=1; D=1; s=12

We fit these model, and compare the AIC,BIC and AICc values.

#write a funtion
SARIMA.c=function(p1,p2,q1,q2,P1,P2,Q1,Q2,data){
  
#K=(p2+1)*(q2+1)*(P2+1)*(Q2+1)
  
  temp=c()
  d=1
  D=1
  s=12
  
  i=1
  temp= data.frame()
  ls=matrix(rep(NA,9*24),nrow=24)
  
  
  for (p in p1:p2)
  {
    for(q in q1:q2)
    {
      for(P in P1:P2)
      {
        for(Q in Q1:Q2)
        {
          if(p+d+q+P+D+Q<=10)
          {
            
            model<- Arima(data,order=c(p,d,q),seasonal=c(P,D,Q))
            ls[i,]= c(p,d,q,P,D,Q,model$aic,model$bic,model$aicc)
            i=i+1
            #print(i)
            
          }
          
        }
      }
    }
    
  }
  
  
  temp= as.data.frame(ls)
  names(temp)= c("p","d","q","P","D","Q","AIC","BIC","AICc")
  
  temp
  
}

output=SARIMA.c(p1=1,p2=2,q1=0,q2=1,P1=0,P2=1,Q1=0,Q2=2,data=lh02)
#SARIMA.c(2,3,1,2,1,2,1,3,data=lh02)

minaic=output[which.min(output$AIC),]
minbic=output[which.min(output$BIC),]

minaic
   p d q P D Q       AIC      BIC      AICc
21 2 1 1 0 1 2 -484.5096 -464.996 -484.0531
minbic
   p d q P D Q       AIC       BIC      AICc
14 2 1 0 0 1 1 -482.7753 -469.7662 -482.5602

The models suggested are: ARIMA(2,1,1)x(0,1,2) ARIMA(2,1,0)x(0,1,1)

Another way of doing this:

d=1
D=1
s=12

ARIMA_fit <- list()

## set counter
cc <- 1

for (p in 2:3)#2 p=1,2
{
  for(q in 1:2)#2 q=1,2
  {
    for(P in 1:2)#2 P=0,1
    {
      for(Q in 1:3)#3 Q=0,1,2
      {
        if(p-1+d+q-1+P-1+D+Q-1<=10)
        {
          
          ARIMA_fit[[cc]] <- Arima(lh02,order=c(p-1,d,q-1),seasonal=c(P-1,D,Q-1))
          cc <- cc + 1             
        }
        
      }
    }
  }

}


# get AIC values for model evaluation
ARIMA_AIC <- sapply( ARIMA_fit, function(x) AIC(x)) 
## model with lowest AIC is the best
ARIMA_fit[[which(ARIMA_AIC == min(ARIMA_AIC))]]
Series: lh02 
ARIMA(2,1,1)(0,1,2)[12] 

Coefficients:
          ar1      ar2     ma1     sma1     sma2
      -1.1358  -0.5753  0.3683  -0.5318  -0.1817
s.e.   0.1608   0.0965  0.1884   0.0838   0.0881

sigma^2 = 0.004278:  log likelihood = 248.25
AIC=-484.51   AICc=-484.05   BIC=-465
ARIMA_BIC <- sapply(ARIMA_fit , function(x) BIC(x)) 

## model with lowest BIC is the best 
ARIMA_fit[[which(ARIMA_BIC == min(ARIMA_BIC))]]
Series: lh02 
ARIMA(2,1,0)(0,1,1)[12] 

Coefficients:
          ar1      ar2     sma1
      -0.8491  -0.4207  -0.6401
s.e.   0.0712   0.0714   0.0694

sigma^2 = 0.004387:  log likelihood = 245.39
AIC=-482.78   AICc=-482.56   BIC=-469.77
#ARIMA_fit[[which(ARIMA_AIC == min(ARIMA_AIC)) & which(ARIMA_BIC == min(ARIMA_BIC)) ]]

The models suggested are: ARIMA(2,1,1)x(0,1,2) ARIMA(2,1,0)x(0,1,1)

(fit <- Arima(h02, order=c(2,1,1), seasonal=c(0,1,2),lambda=0))
Series: h02 
ARIMA(2,1,1)(0,1,2)[12] 
Box Cox transformation: lambda= 0 

Coefficients:
          ar1      ar2     ma1     sma1     sma2
      -1.1358  -0.5753  0.3683  -0.5318  -0.1817
s.e.   0.1608   0.0965  0.1884   0.0838   0.0881

sigma^2 = 0.004278:  log likelihood = 248.25
AIC=-484.51   AICc=-484.05   BIC=-465
checkresiduals(fit, lag=36)


    Ljung-Box test

data:  Residuals from ARIMA(2,1,1)(0,1,2)[12]
Q* = 51.096, df = 31, p-value = 0.01298

Model df: 5.   Total lags used: 36
#or

(fit2 <- Arima(lh02, order=c(2,1,0), seasonal=c(0,1,1)))
Series: lh02 
ARIMA(2,1,0)(0,1,1)[12] 

Coefficients:
          ar1      ar2     sma1
      -0.8491  -0.4207  -0.6401
s.e.   0.0712   0.0714   0.0694

sigma^2 = 0.004387:  log likelihood = 245.39
AIC=-482.78   AICc=-482.56   BIC=-469.77
checkresiduals(fit2, lag=36)


    Ljung-Box test

data:  Residuals from ARIMA(2,1,0)(0,1,1)[12]
Q* = 59.282, df = 33, p-value = 0.003322

Model df: 3.   Total lags used: 36

In another article https://otexts.com/fpp2/seasonal-arima.html they have found that the best model is ARIMA(3,0,1)(0,1,2)12.

(fit <- Arima(h02, order=c(3,0,1), seasonal=c(0,1,2),lambda=0))
Series: h02 
ARIMA(3,0,1)(0,1,2)[12] 
Box Cox transformation: lambda= 0 

Coefficients:
          ar1     ar2     ar3     ma1     sma1     sma2
      -0.1603  0.5481  0.5678  0.3827  -0.5222  -0.1768
s.e.   0.1636  0.0878  0.0942  0.1895   0.0861   0.0872

sigma^2 = 0.004278:  log likelihood = 250.04
AIC=-486.08   AICc=-485.48   BIC=-463.28
checkresiduals(fit, lag=36)


    Ljung-Box test

data:  Residuals from ARIMA(3,0,1)(0,1,2)[12]
Q* = 50.712, df = 30, p-value = 0.01045

Model df: 6.   Total lags used: 36

Model Diagnostic: There are a few significant spikes in the ACF, and the model fails the Ljung-Box test. The model can still be used for forecasting, but the prediction intervals may not be accurate due to the correlated residuals.

###### Capture SARIMA Model Outputs ######

# SARIMA(2,1,1)(0,1,2)[12]
model_output1 <- capture.output(sarima(lh02, 2, 1, 1, 0, 1, 2, 12))

cat(model_output1[grep("Coefficients", model_output1):length(model_output1)], sep = "\n")
Coefficients: 
     Estimate     SE t.value p.value
ar1   -1.1358 0.1608 -7.0622  0.0000
ar2   -0.5753 0.0965 -5.9630  0.0000
ma1    0.3683 0.1884  1.9546  0.0521
sma1  -0.5318 0.0838 -6.3462  0.0000
sma2  -0.1817 0.0881 -2.0622  0.0406

sigma^2 estimated as 0.004165779 on 186 degrees of freedom 
 
AIC = -2.5367  AICc = -2.535002  BIC = -2.434534 
 
# SARIMA(2,1,0)(0,1,1)[12]
model_output2 <- capture.output(sarima(lh02, 2, 1, 0, 0, 1, 1, 12))

cat(model_output2[grep("Coefficients", model_output2):length(model_output2)], sep = "\n")
Coefficients: 
     Estimate     SE  t.value p.value
ar1   -0.8491 0.0712 -11.9258       0
ar2   -0.4207 0.0714  -5.8926       0
sma1  -0.6401 0.0694  -9.2245       0

sigma^2 estimated as 0.004318266 on 188 degrees of freedom 
 
AIC = -2.527619  AICc = -2.526947  BIC = -2.459509 
 
# SARIMA(3,0,1)(0,1,2)[12]
model_output3 <- capture.output(sarima(lh02, 3, 0, 1, 0, 1, 2, 12))

cat(model_output3[grep("Coefficients", model_output3):length(model_output3)], sep = "\n")
Coefficients: 
         Estimate     SE t.value p.value
ar1       -0.2653 0.1691 -1.5687  0.1184
ar2        0.5011 0.0813  6.1665  0.0000
ar3        0.5394 0.0848  6.3626  0.0000
ma1        0.4572 0.1904  2.4013  0.0173
sma1      -0.5031 0.0847 -5.9385  0.0000
sma2      -0.2030 0.0871 -2.3305  0.0209
constant   0.0038 0.0009  4.4331  0.0000

sigma^2 estimated as 0.004023383 on 185 degrees of freedom 
 
AIC = -2.552016  AICc = -2.548846  BIC = -2.416287 
 

The Standard Residual Plot appears good in all 3 models, displaying okay stationarity with a nearly constant mean and variation.

The Autocorrelation Function (ACF) plot shows nearly no correlation in all 3 models.

The Quantile-Quantile (Q-Q) Plots still demonstrates near-normality.

The Ljung-Box test results reveal values above the 0.05 (5% significance) threshold, indicating the absence of significant correlation and suggesting a well-fitted model.

$ttable:

sarima(lh02,2,1,1,0,1,2,12) all coefficients are significant except ma1. ;aic = -484.51 sarima(lh02,2,1,0,0,1,1,12) all coefficients are significant. ;aic = -482.78 sarima(lh02,3,0,1,0,1,2,12) all coefficients are significant. ;aic = -489.99

Next we will try using the automatic ARIMA algorithm.

auto.arima(lh02)
Series: lh02 
ARIMA(2,1,1)(0,1,2)[12] 

Coefficients:
          ar1      ar2     ma1     sma1     sma2
      -1.1358  -0.5753  0.3683  -0.5318  -0.1817
s.e.   0.1608   0.0965  0.1884   0.0838   0.0881

sigma^2 = 0.004278:  log likelihood = 248.25
AIC=-484.51   AICc=-484.05   BIC=-465

Running auto.arima() with all arguments left at their default values led to an ARIMA(2,1,1)(0,1,2) 12 model. However, as you can see from the below results from the checkresiduals(); model still fails the Ljung-Box test for 36 lags. Sometimes it is just not possible to find a model that passes all of the tests.

(fit4 <- Arima(lh02, order=c(2,1,1), seasonal=c(0,1,2)))
Series: lh02 
ARIMA(2,1,1)(0,1,2)[12] 

Coefficients:
          ar1      ar2     ma1     sma1     sma2
      -1.1358  -0.5753  0.3683  -0.5318  -0.1817
s.e.   0.1608   0.0965  0.1884   0.0838   0.0881

sigma^2 = 0.004278:  log likelihood = 248.25
AIC=-484.51   AICc=-484.05   BIC=-465
checkresiduals(fit4, lag=36)


    Ljung-Box test

data:  Residuals from ARIMA(2,1,1)(0,1,2)[12]
Q* = 51.096, df = 31, p-value = 0.01298

Model df: 5.   Total lags used: 36

It looks however, sarima(lh02,3,0,1,0,1,2,12) is a better model.

fit<-h02 %>%
  Arima(order=c(3,0,1), seasonal=c(0,1,2), lambda=0)

autoplot(h02) +
  autolayer(meanf(h02, h=36),
            series="Mean", PI=FALSE) +
  autolayer(naive(h02, h=36),
            series="Naïve", PI=FALSE) +
  autolayer(snaive(h02, h=36),
            series="SNaïve", PI=FALSE)+
  autolayer(rwf(h02, h=36, drift=TRUE),
            series="Drift", PI=FALSE)+
  autolayer(forecast(fit,36), 
            series="fit",PI=FALSE) +
  guides(colour=guide_legend(title="Forecast"))

f2 <- snaive(h02, h=36) 

#checkresiduals(f2)

accuracy(f2)
                     ME       RMSE        MAE      MPE    MAPE MASE      ACF1
Training set 0.03265974 0.07415608 0.06061829 4.345164 8.02116    1 0.2700878
summary(fit)
Series: . 
ARIMA(3,0,1)(0,1,2)[12] 
Box Cox transformation: lambda= 0 

Coefficients:
          ar1     ar2     ar3     ma1     sma1     sma2
      -0.1603  0.5481  0.5678  0.3827  -0.5222  -0.1768
s.e.   0.1636  0.0878  0.0942  0.1895   0.0861   0.0872

sigma^2 = 0.004278:  log likelihood = 250.04
AIC=-486.08   AICc=-485.48   BIC=-463.28

Training set error measures:
                      ME      RMSE        MAE       MPE     MAPE      MASE
Training set 0.001736226 0.0493495 0.03596218 0.2515195 4.621055 0.5932563
                    ACF1
Training set -0.01964279

Accuracy measurements are smaller in the fitted model indicating a better model.

Forecasts from the ARIMA(3,0,1)(0,1,2)12 model.

h02 %>%
  Arima(order=c(3,0,1), seasonal=c(0,1,2), lambda=0) %>%
  forecast() %>%
  autoplot() +
    ylab("H02 sales (million scripts)") + xlab("Year")